This module extends code contained in Coronavirus_Statistics_v004.Rmd to include sourcing of all key functions and parameters. This file includes the latest code for analyzing all-cause death data from CDC Weekly Deaths by Jurisdiction. CDC maintains data on deaths by week, age cohort, and state in the US. Downloaded data are unique by state, epidemiological week, year, age, and type (actual vs. predicted/projected).
These data are known to have a lag between death and reporting, and the CDC back-correct to report deaths at the time the death occurred even if the death is reported in following weeks. This means totals for recent weeks tend to run low (lag), and the CDC run a projection of the expected total number of deaths given the historical lag times. Per other analysts on the internet, there is currently significant supra-lag, with lag times much longer than historical averages causing CDC projected deaths for recent weeks to be low.
The code leverages tidyverse and sourced functions throughout:
# All functions assume that tidyverse and its components are loaded and available
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Generic_Added_Utility_Functions_202105_v001.R")
source("./Coronavirus_CDC_Excess_Functions_v001.R")
The main function is readRunCDCAllCause(), which performs multiple tasks:
STEP 0: Optionally, downloads the latest data file from CDC STEP 1: Reads and processes a data file has been downloaded from CDC to local
STEP 2: Extract relevant data from a processed state-level COVID Tracking Project list
STEP 3: Basic plots of the CDC data
STEP 4: Basic excess-deaths analysis
STEP 5: Create cluster-level aggregate plots
STEP 6: Create state-level aggregate plots
STEP 7: Create age-cohort aggregate plots
STEP 8: Returns a list of key data frames, modeling objects, named cluster vectors, etc.
The functions are tested on previously downloaded data:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210623.csv"
cdcList_20210703 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=17,
lst=readFromRDS("cdc_daily_210528"),
dlData=FALSE,
stateNoCheck=c("NC"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-05-01
##
##
## *** Data suppression checks ***
## # A tibble: 2 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE NC TRUE FALSE 72 NA
## 2 TRUE NC TRUE TRUE 6 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 72
## 2 TRUE TRUE 6
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 91,537
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10735 0 369164
## 2 25-44 years 13656 0 902390
## 3 45-64 years 16793 0 3549786
## 4 65-74 years 16783 0 3558139
## 5 75-84 years 16790 0 4401133
## 6 85 years and older 16780 0 5681860
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691180
## 2 2015-2019 2016 Predicted (weighted) 14445 0 2723236
## 3 2015-2019 2017 Predicted (weighted) 14404 0 2801986
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830372
## 5 2015-2019 2019 Predicted (weighted) 14415 0 2844025
## 6 2020 2020 Predicted (weighted) 14837 0 3433405
## 7 2021 2021 Predicted (weighted) 4672 0 1138268
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72028 0 13890799
## 2 2020 <NA> 14837 0 3433405
## 3 2021 <NA> 4672 0 1138268
##
##
## Checking variable combination: period Note
## # A tibble: 9 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72028 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only ~ 13194 0 2.96e6
## 3 2020 Data in recent weeks are incomplete. Only ~ 531 0 2.31e5
## 4 2020 Weighted numbers of deaths are 20% or more~ 280 0 6.00e4
## 5 2020 Weights may be too low to account for unde~ 18 0 9.85e3
## 6 2020 <NA> 814 0 1.69e5
## 7 2021 Data in recent weeks are incomplete. Only ~ 4469 0 1.10e6
## 8 2021 Data in recent weeks are incomplete. Only ~ 14 0 9.65e2
## 9 2021 Data in recent weeks are incomplete. Only ~ 189 0 3.58e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf
##
## Returning plot outputs to the main log file
The latest data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210708.csv"
cdcList_20210708 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=22,
lst=readFromRDS("cdc_daily_210708"),
stateNoCheck=c("NC", "AK", "WV"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-06-05
##
##
## *** Data suppression checks ***
## # A tibble: 4 x 6
## noCheck state problem curWeek n deaths
## <lgl> <chr> <lgl> <lgl> <int> <dbl>
## 1 TRUE AK TRUE FALSE 2 NA
## 2 TRUE NC TRUE FALSE 102 NA
## 3 TRUE NC TRUE TRUE 6 NA
## 4 TRUE WV TRUE TRUE 2 NA
## # A tibble: 2 x 3
## noCheck curWeek n
## <lgl> <lgl> <int>
## 1 TRUE FALSE 104
## 2 TRUE TRUE 8
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 92,880
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10890 0 374959
## 2 25-44 years 13868 0 919211
## 3 45-64 years 17038 0 3605423
## 4 65-74 years 17027 0 3615820
## 5 75-84 years 17033 0 4467166
## 6 85 years and older 17024 0 5757892
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691176
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14408 0 2802027
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830376
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844003
## 6 2020 2020 Predicted (weighted) 14838 0 3432903
## 7 2021 2021 Predicted (weighted) 6013 0 1416773
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72029 0 13890795
## 2 2020 <NA> 14838 0 3432903
## 3 2021 <NA> 6013 0 1416773
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72029 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13459 0 3.04e6
## 3 2020 Data in recent weeks are incomplete. Only~ 5 0 1.24e2
## 4 2020 Data in recent weeks are incomplete. Only~ 262 0 1.57e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.95e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 5631 0 1.34e6
## 9 2021 Data in recent weeks are incomplete. Only~ 24 0 2.00e3
## 10 2021 Data in recent weeks are incomplete. Only~ 358 0 7.15e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w22.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w22.pdf
##
## Returning plot outputs to the main log file
saveToRDS(cdcList_20210708)
The function readProcessCDC() is updated to allow for more control in zeroing out (rather than erroring) where there is a small number of data suppression:
# Function to check for CDC excess suppression
checkCDCSuppression <- function(df, stateNoCheck, errTotAllowed=20, errMaxAllowed=round(errTotAllowed/2)) {
# Categorize the potential issues in the file (note to suppress or NA deaths)
checkProblems <- df %>%
mutate(problem=(!is.na(Suppress) | is.na(deaths)),
noCheck=state %in% all_of(stateNoCheck)
)
# Print a list of the problems, excluding those in stateNoCheck
cat("\nRows in states to be checked that have NA deaths or a note for suppression:\n")
checkProblems %>%
filter(problem, !noCheck) %>%
arrange(desc(year), desc(week)) %>%
select(state, weekEnding, year, week, age, Suppress, deaths) %>%
as.data.frame() %>%
print()
# Summarize the problems
cat("\n\nProblems by state:\n")
checkProblems %>%
group_by(noCheck, state, problem) %>%
summarize(n=n(), deaths=specNA(sum)(deaths), .groups="drop") %>%
filter(problem) %>%
print()
# Assess the amount of error
errorState <- checkProblems %>%
filter(problem, !noCheck) %>%
count(state)
# Error out if threshold for error by state OR total errors exceeded
errMax <- errorState %>% pull(n) %>% max()
errTot <- errorState %>% pull(n) %>% sum()
cat("\n\nThere are", errTot, "rows with errors; maximum for any given state is", errMax, "errors\n")
if ((errTot > errTotAllowed) | (errMax > errMaxAllowed)) {
stop("\nToo many errors; thresholds are ", errTotAllowed, " total and ", errMaxAllowed, " maximum\n")
}
}
plotQCReadProcessCDC <- function(df,
ckCombos=list(c("age"), c("period", "year", "Type"),
c("period", "Suppress"), c("period", "Note")
)
) {
# Create dataset for analysis
df <- df %>%
mutate(n=1, n_deaths_na=ifelse(is.na(deaths), 1, 0))
# Check control totals by specified combinaions
purrr::walk(ckCombos, .f=function(x) {
cat("\n\nChecking variable combination:", x, "\n")
checkControl(df, groupBy=x, useVars=c("n", "n_deaths_na", "deaths"), fn=specNA(sum))
}
)
# Plot deaths by state
p1 <- checkControl(df,
groupBy=c("state"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=fct_reorder(state, deaths), y=deaths)) +
geom_col(fill="lightblue") +
geom_text(aes(y=deaths, label=paste0(round(deaths/1000), "k")), hjust=0, size=3) +
coord_flip() +
labs(y="Total deaths", x=NULL, title="Total deaths by state in all years in processed file")
print(p1)
# Plot deaths by week/year
p2 <- checkControl(df,
groupBy=c("year", "week"),
useVars=c("deaths"),
fn=specNA(sum),
printControls=FALSE,
pivotData=FALSE
) %>%
ggplot(aes(x=week, y=deaths)) +
geom_line(aes(group=year, color=year)) +
labs(title="Deaths by year and epidemiological week", x="Epi week", y="US deaths") +
scale_color_discrete("Year") +
lims(y=c(0, NA))
print(p2)
}
# Function to read and process raw CDC all-cause deaths data
readProcessCDC <- function(fName,
weekThru,
periodKeep=cdcExcessParams$periodKeep,
fDir="./RInputFiles/Coronavirus/",
col_types=cdcExcessParams$colTypes,
renameVars=cdcExcessParams$remapVars,
maxSuppressAllowed=20,
stateNoCheck=c()
) {
# FUNCTION ARGUMENTS:
# fName: name of the downloaded CDC data file
# weekThru: any record where week is less than or equal to weekThru will be kept
# periodKeep: any record where period is in periodKeep will be kept
# fDir: directory name for the downloaded CDC data file
# col_types: variable type by column in the CDC data (passed to readr::read_csv())
# renameVars: named vector for variable renaming of type c("Existing Name"="New Name")
# maxSuppressAllowed: maximum number of data suppressions (must be in current week/year) to avoid error
# stateNoCheck: vector of states that do NOT have suppression errors thrown
# STEP 1: Read the CSV data
cdcRaw <- fileRead(paste0(fDir, fName), col_types=col_types)
# glimpse(cdcRaw)
# STEP 2: Rename the variables for easier interpretation
cdcRenamed <- cdcRaw %>%
colRenamer(vecRename=renameVars) %>%
colMutater(selfList=list("weekEnding"=lubridate::mdy))
# glimpse(cdcRenamed)
# STEP 3: Convert to factored data
cdcFactored <- cdcRenamed %>%
colMutater(selfList=list("age"=factor), levels=cdcExcessParams$ageLevels) %>%
colMutater(selfList=list("period"=factor), levels=cdcExcessParams$periodLevels) %>%
colMutater(selfList=list("year"=factor), levels=cdcExcessParams$yearLevels)
# glimpse(cdcFactored)
# STEP 4: Filter the data to include only weighted deaths and only through the desired time period
cdcFiltered <- cdcFactored %>%
rowFilter(lstFilter=list("Type"="Predicted (weighted)")) %>%
filter(period %in% all_of(periodKeep) | week <= weekThru)
# glimpse(cdcFiltered)
# STEP 4a: Check that all suppressed data and NA deaths have been eliminated
cat("\n\n *** Data suppression checks *** \n")
checkCDCSuppression(cdcFiltered, stateNoCheck=stateNoCheck, errTotAllowed=maxSuppressAllowed)
cat("\n\nData suppression checks passed\n\n")
# STEP 5: Remove any NA death fields, delete the US record, convert YC to be part of NY
cdcProcessed <- cdcFiltered %>%
rowFilter(lstExclude=list("state"=c("US", "PR"), "deaths"=c(NA))) %>%
mutate(state=ifelse(state=="YC", "NY", state),
fullState=ifelse(state %in% c("NY", "YC"), "New York State (NY plus YC)", fullState)
) %>%
group_by(fullState, weekEnding, state, year, week, age, period, Type, Suppress) %>%
arrange(!is.na(Note)) %>%
summarize(n=n(), deaths=sum(deaths), Note=first(Note), .groups="drop") %>%
ungroup() %>%
checkUniqueRows(uniqueBy=c("state", "year", "week", "age"))
glimpse(cdcProcessed)
# STEP 5a: Check control levels for key variables in processed file
cat("\nCheck Control Levels and Record Counts for Processed Data:\n")
plotQCReadProcessCDC(cdcProcessed)
# STEP 6: Return the processed data file
cdcProcessed
}
The data are processed using the updated function:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210708.csv"
cdcList_20210708_v2 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=23,
lst=readFromRDS("cdc_daily_210708"),
stateNoCheck=c("NC"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-06-12
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## state weekEnding year week age
## 1 CT 2021-06-12 2021 23 45-64 years
## 2 CT 2021-06-12 2021 23 65-74 years
## 3 CT 2021-06-12 2021 23 75-84 years
## 4 CT 2021-06-12 2021 23 85 years and older
## 5 DE 2021-06-12 2021 23 65-74 years
## 6 DE 2021-06-12 2021 23 75-84 years
## 7 DE 2021-06-12 2021 23 85 years and older
## 8 WV 2021-06-05 2021 22 45-64 years
## 9 WV 2021-06-05 2021 22 65-74 years
## 10 AK 2021-05-08 2021 18 45-64 years
## 11 AK 2021-05-08 2021 18 65-74 years
## Suppress deaths
## 1 Suppressed (counts highly incomplete, <50% of expected) NA
## 2 Suppressed (counts highly incomplete, <50% of expected) NA
## 3 Suppressed (counts highly incomplete, <50% of expected) NA
## 4 Suppressed (counts highly incomplete, <50% of expected) NA
## 5 Suppressed (counts highly incomplete, <50% of expected) NA
## 6 Suppressed (counts highly incomplete, <50% of expected) NA
## 7 Suppressed (counts highly incomplete, <50% of expected) NA
## 8 Suppressed (counts highly incomplete, <50% of expected) NA
## 9 Suppressed (counts highly incomplete, <50% of expected) NA
## 10 Suppressed (counts highly incomplete, <50% of expected) NA
## 11 Suppressed (counts highly incomplete, <50% of expected) NA
##
##
## Problems by state:
## # A tibble: 5 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 FALSE AK TRUE 2 NA
## 2 FALSE CT TRUE 4 NA
## 3 FALSE DE TRUE 3 NA
## 4 FALSE WV TRUE 2 NA
## 5 TRUE NC TRUE 114 NA
##
##
## There are 11 rows with errors; maximum for any given state is 4 errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 93,132
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 10919 0 375951
## 2 25-44 years 13908 0 922283
## 3 45-64 years 17084 0 3615594
## 4 65-74 years 17072 0 3626546
## 5 75-84 years 17079 0 4479686
## 6 85 years and older 17070 0 5772387
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691176
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14408 0 2802027
## 4 2015-2019 2018 Predicted (weighted) 14400 0 2830376
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844003
## 6 2020 2020 Predicted (weighted) 14838 0 3432903
## 7 2021 2021 Predicted (weighted) 6265 0 1468749
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72029 0 13890795
## 2 2020 <NA> 14838 0 3432903
## 3 2021 <NA> 6265 0 1468749
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72029 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13459 0 3.04e6
## 3 2020 Data in recent weeks are incomplete. Only~ 5 0 1.24e2
## 4 2020 Data in recent weeks are incomplete. Only~ 262 0 1.57e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.95e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 5822 0 1.38e6
## 9 2021 Data in recent weeks are incomplete. Only~ 34 0 3.23e3
## 10 2021 Data in recent weeks are incomplete. Only~ 409 0 8.16e4
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w23.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w23.pdf
##
## Returning plot outputs to the main log file
The latest data are downloaded and processed:
cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210823.csv"
cdcList_20210823 <- readRunCDCAllCause(loc=cdcLoc,
weekThru=29,
lst=readFromRDS("cdc_daily_210815"),
stateNoCheck=c("NC", "AK", "CT"),
pdfCluster=TRUE,
pdfAge=TRUE
)
##
## Parameter cvDeathThru has been set as: 2021-07-24
##
##
## *** Data suppression checks ***
##
## Rows in states to be checked that have NA deaths or a note for suppression:
## state weekEnding year week age
## 1 NE 2021-07-24 2021 29 45-64 years
## 2 NE 2021-07-24 2021 29 65-74 years
## 3 NE 2021-07-24 2021 29 75-84 years
## 4 NE 2021-07-24 2021 29 85 years and older
## Suppress deaths
## 1 Suppressed (counts highly incomplete, <50% of expected) NA
## 2 Suppressed (counts highly incomplete, <50% of expected) NA
## 3 Suppressed (counts highly incomplete, <50% of expected) NA
## 4 Suppressed (counts highly incomplete, <50% of expected) NA
##
##
## Problems by state:
## # A tibble: 4 x 5
## noCheck state problem n deaths
## <lgl> <chr> <lgl> <int> <dbl>
## 1 FALSE NE TRUE 4 NA
## 2 TRUE AK TRUE 2 NA
## 3 TRUE CT TRUE 2 NA
## 4 TRUE NC TRUE 120 NA
##
##
## There are 4 rows with errors; maximum for any given state is 4 errors
##
##
## Data suppression checks passed
##
##
## *** File has been checked for uniqueness by: state year week age
##
## Rows: 94,758
## Columns: 12
## $ fullState <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
##
## Check Control Levels and Record Counts for Processed Data:
##
##
## Checking variable combination: age
## # A tibble: 6 x 4
## age n n_deaths_na deaths
## <fct> <dbl> <dbl> <dbl>
## 1 Under 25 years 11107 0 383113
## 2 25-44 years 14165 0 943695
## 3 45-64 years 17377 0 3682738
## 4 65-74 years 17367 0 3696383
## 5 75-84 years 17375 0 4559955
## 6 85 years and older 17367 0 5864442
##
##
## Checking variable combination: period year Type
## # A tibble: 7 x 6
## period year Type n n_deaths_na deaths
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 2015 Predicted (weighted) 14364 0 2691178
## 2 2015-2019 2016 Predicted (weighted) 14443 0 2723213
## 3 2015-2019 2017 Predicted (weighted) 14406 0 2802009
## 4 2015-2019 2018 Predicted (weighted) 14398 0 2830356
## 5 2015-2019 2019 Predicted (weighted) 14414 0 2844020
## 6 2020 2020 Predicted (weighted) 14835 0 3432937
## 7 2021 2021 Predicted (weighted) 7898 0 1806613
##
##
## Checking variable combination: period Suppress
## # A tibble: 3 x 5
## period Suppress n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-2019 <NA> 72025 0 13890776
## 2 2020 <NA> 14835 0 3432937
## 3 2021 <NA> 7898 0 1806613
##
##
## Checking variable combination: period Note
## # A tibble: 10 x 5
## period Note n n_deaths_na deaths
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 2015-20~ <NA> 72025 0 1.39e7
## 2 2020 Data in recent weeks are incomplete. Only~ 13494 0 3.05e6
## 3 2020 Data in recent weeks are incomplete. Only~ 4 0 1.17e2
## 4 2020 Data in recent weeks are incomplete. Only~ 225 0 1.47e5
## 5 2020 Weighted numbers of deaths are 20% or mor~ 280 0 6.00e4
## 6 2020 Weights may be too low to account for und~ 10 0 5.96e3
## 7 2020 <NA> 822 0 1.73e5
## 8 2021 Data in recent weeks are incomplete. Only~ 7250 0 1.63e6
## 9 2021 Data in recent weeks are incomplete. Only~ 18 0 5.3 e2
## 10 2021 Data in recent weeks are incomplete. Only~ 630 0 1.74e5
##
## *** File has been checked for uniqueness by: cluster year week
##
## Plots will be run after excluding stateNoCheck states
##
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w29.pdf
##
## Returning plot outputs to the main log file
## Joining, by = "state"
##
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w29.pdf
##
## Returning plot outputs to the main log file
CDC data for deaths by age and location available at CDC website are downloaded, cached to avoid multiple hits to the server:
deathAgeLoc <- "./RInputFiles/Coronavirus/COvID_deaths_age_place_20210824.csv"
if (!file.exists(deathAgeLoc)) {
fileDownload(fileName="./RInputFiles/Coronavirus/COvID_deaths_age_place_20210824.csv",
url="https://data.cdc.gov/api/views/4va6-ph5s/rows.csv?accessType=DOWNLOAD"
)
} else {
cat("\nFile already exists, not downloading\n")
}
##
## File already exists, not downloading
The file is then read for a basic exploration:
deathAge_20210824_raw <- fileRead(deathAgeLoc, col_types="cccciiccccddddddc")
glimpse(deathAge_20210824_raw)
## Rows: 100,602
## Columns: 17
## $ `Data as of` <chr> "08/18/2021", "08/18/2021",~
## $ `Start Date` <chr> "01/01/2020", "01/01/2020",~
## $ `End Date` <chr> "08/14/2021", "08/14/2021",~
## $ Group <chr> "By Total", "By Total", "By~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA,~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA,~
## $ `HHS Region` <chr> "0", "0", "0", "0", "0", "0~
## $ State <chr> "United States", "United St~
## $ `Place of Death` <chr> "Total - All Places of Deat~
## $ `Age group` <chr> "All Ages", "0-17 years", "~
## $ `COVID-19 Deaths` <dbl> 614530, 361, 2630, 7501, 19~
## $ `Total Deaths` <dbl> 5296490, 53192, 100227, 143~
## $ `Pneumonia Deaths` <dbl> 557008, 865, 2814, 6900, 17~
## $ `Pneumonia and COVID-19 Deaths` <dbl> 303039, 73, 1163, 3498, 986~
## $ `Influenza Deaths` <dbl> 9232, 188, 148, 323, 501, 2~
## $ `Pneumonia, Influenza, or COVID-19 Deaths` <dbl> 876434, 1341, 4417, 11201, ~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA,~
deathAge_20210824_conv <- deathAge_20210824_raw %>%
colRenamer(vecRename=c("Data as of"="asofDate",
"Start Date"="startDate",
"End Date"="endDate",
"HHS Region"="HHSRegion",
"Place of Death"="deathPlace",
"Age group"="Age",
"COVID-19 Deaths"="covidDeaths",
"Total Deaths"="totalDeaths",
"Pneumonia Deaths"="pneumoDeaths",
"Pneumonia and COVID-19 Deaths"="pneumoCovidDeaths",
"Influenza Deaths"="fluDeaths",
"Pneumonia, Influenza, or COVID-19 Deaths"="pnemoFluCovidDeaths"
)
) %>%
colMutater(selfList=list("asofDate"=lubridate::mdy, "startDate"=lubridate::mdy, "endDate"=lubridate::mdy))
glimpse(deathAge_20210824_conv)
## Rows: 100,602
## Columns: 17
## $ asofDate <date> 2021-08-18, 2021-08-18, 2021-08-18, 2021-08-18, 2~
## $ startDate <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2~
## $ endDate <date> 2021-08-14, 2021-08-14, 2021-08-14, 2021-08-14, 2~
## $ Group <chr> "By Total", "By Total", "By Total", "By Total", "B~
## $ Year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ Month <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ HHSRegion <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", ~
## $ State <chr> "United States", "United States", "United States",~
## $ deathPlace <chr> "Total - All Places of Death", "Total - All Places~
## $ Age <chr> "All Ages", "0-17 years", "18-29 years", "30-39 ye~
## $ covidDeaths <dbl> 614530, 361, 2630, 7501, 19776, 98973, 137149, 167~
## $ totalDeaths <dbl> 5296490, 53192, 100227, 143051, 212953, 881095, 10~
## $ pneumoDeaths <dbl> 557008, 865, 2814, 6900, 17026, 92781, 130216, 154~
## $ pneumoCovidDeaths <dbl> 303039, 73, 1163, 3498, 9861, 52942, 74134, 85579,~
## $ fluDeaths <dbl> 9232, 188, 148, 323, 501, 2191, 1997, 2003, 1881, ~
## $ pnemoFluCovidDeaths <dbl> 876434, 1341, 4417, 11201, 27371, 140656, 194900, ~
## $ Footnote <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
# Combinations of startDate and endDate
deathAge_20210824_conv %>%
count(asofDate, startDate, endDate) %>%
ggplot(aes(y=startDate, x=endDate)) +
geom_point(aes(size=n)) +
facet_wrap(~asofDate) +
labs(x="Ending Date", y="Starting Date", title="Combinations of Start and End Date")
deathAge_20210824_conv %>%
count(Group, deathPlace, Age) %>%
ggplot(aes(x=Group, y=deathPlace)) +
geom_tile(aes(fill=n)) +
facet_wrap(~Age) +
labs(x="Group", y="Place of Death", title="Combinations of Age, Place of Death, and Group")
deathState <- deathAge_20210824_conv %>%
filter(Group=="By Total", deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
group_by(State) %>%
summarize(across(where(is.numeric), sum, na.rm=TRUE)) %>%
mutate(abb=state.abb[match(State, state.name)])
deathState %>% filter(is.na(abb))
## # A tibble: 4 x 10
## State Year Month covidDeaths totalDeaths pneumoDeaths pneumoCovidDeat~
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 District of~ 0 0 1501 11580 1847 1228
## 2 New York Ci~ 0 0 29547 121838 17718 11098
## 3 Puerto Rico 0 0 2567 49898 6883 1823
## 4 United Stat~ 0 0 614530 5296490 557008 303039
## # ... with 3 more variables: fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # abb <chr>
deathBase <- deathState %>%
select(State, covidDeaths, totalDeaths) %>%
mutate(noncovid=covidDeaths/totalDeaths) %>%
filter(!(State %in% c("United States", "Puerto Rico"))) %>%
pivot_longer(-c(State)) %>%
ggplot(aes(x=fct_reorder(State, value, max), y=value/1000)) +
coord_flip() +
theme(legend.position="bottom")
deathBase +
geom_col(data=~filter(., name=="totalDeaths"), aes(fill="All")) +
geom_col(data=~filter(., name=="covidDeaths"), aes(fill="COVID")) +
scale_fill_manual("Type", breaks=c("COVID", "All"), labels=c("COVID", "All"), values=c("red", "black")) +
labs(title="Deaths 2020-present by state", x=NULL, y="Deaths (000s)")
deathBase +
geom_col(data=~filter(., name=="noncovid"), aes(y=value), position="identity") +
labs(x=NULL, y=NULL, title="Proportion of deaths from COVID")
The data appear to contain monthly totals, with the addition of full-year 2020, YTD 2021, and total 2020-YTD 2021. Totals are provided by age sub-group and overall, place of death category and overall, and monthly, annually, and total.
Total deaths and proportions from COVID appear sensible. Next steps are to continue processing and exploring the data:
# Add the state abbreviation
deathAge_20210824_conv <- deathAge_20210824_conv %>%
mutate(abb=c(state.abb, "DC")[match(State, c(state.name, "District of Columbia"))])
# Function to check that totals match sum of sub-totals
checkSubTotals <- function(df, checkByVars, subVar, subVarTotal, sumVars=NULL, sumFunc=specNA(sum), ...) {
# FUNCTION ARGUMENTS:
# df: data.frame or tibble
# checkByVars: variables that the frame will be checked by
# subVar: variable that is being checked
# subVarTotal: label for the value that is the total of subVar
# sumVars: variables to be summed (NULL means all numeric)
# sumFunc: function to be applied when summing all variables
# ...: any other arguments to pass to summarize(across(all_of(checkByVars), .fns=sumFunc, ...))
# If sumVars is NULL, find the sum variables
if (is.null(sumVars)) sumVars <- df %>% head(1) %>% select_if(is.numeric) %>% names()
# Keep only te desired variables in df
df <- df %>%
select(all_of(c(checkByVars, subVar, sumVars))) %>%
arrange(across(all_of(checkByVars)))
# Split the data frame by subtotal and total
dfTot <- df %>%
filter(get(subVar) == subVarTotal)
dfSub <- df %>%
filter(get(subVar) != subVarTotal) %>%
group_by(across(all_of(checkByVars))) %>%
summarize(across(all_of(sumVars), .fns=sumFunc, ...), .groups="drop") %>%
mutate(fakeCol=subVarTotal) %>%
colRenamer(vecRename=c("fakeCol"=subVar)) %>%
select(names(dfTot))
# Comparison of totals
list(dfSub=dfSub, dfTot=dfTot)
}
checkNumbers <- function(lst, byVars, lstNames=NULL, absTol=100, pctTol=0.05, keyVar="key variable") {
# FUNCTION ARGUMENTS:
# lst: a list with two items that will be checked for similarity
# byVars: by variables that should be identical across the list items
# lstNames: names to use for the list (NULL means use names provided in lst)
# absTol: absolute value of differences to flag
# pctTol: percent tolerance for differences to flag
# keyVar: name for the key variable in plot title
# Check that lst is a list of length 2
if (!("list" %in% class(lst)) | !(length(lst)==2)) stop("\nMust pass a list with two items\n")
# Add names if passed in lstNames, otherwise use names(lst)
if (!is.null(lstNames)) names(lst) <- lstNames
else lstNames <- names(lst)
# Check for identical files using only byVars
if (!isTRUE(identical(lst[[1]][, byVars], lst[[2]][, byVars])))
stop("\nSub-lists differ by byVars, not comparing\n")
else cat("\nSub-lists are identical by:", paste0(byVars, collapse=", "), "\n")
# Check the numeric values
dfDelta <- lapply(lst, FUN=function(x) pivot_longer(x, cols=-all_of(byVars)) %>%
mutate(value=ifelse(is.na(value), 0, value)) %>%
select(all_of(byVars), name, value)
) %>%
purrr::reduce(.f=inner_join, by=c(all_of(byVars), "name")) %>%
mutate(delta=value.x-value.y, pct=ifelse(delta==0, 0, delta/(value.x+value.y))) %>%
purrr::set_names(c(all_of(byVars), "name", all_of(lstNames), "delta", "pct"))
# Plot the differences using name as facet
p1 <- dfDelta %>%
ggplot(aes(x=delta, y=pct)) +
geom_point() +
facet_wrap(~name, scales="free") +
labs(title=paste0("Differences between totals and subtotals on variable: ", keyVar),
x="Difference between total and subtotal",
y="Percentage difference"
)
print(p1)
# Flag significant outliers
dfDelta %>%
filter(abs(delta) >= absTol, abs(pct) >= pctTol) %>%
arrange(-abs(delta)) %>%
print()
}
# Get a list of the possible variables
allCheckVars <- names(deathAge_20210824_conv) %>%
setdiff(deathAge_20210824_conv %>% head(1) %>% select_if(is.numeric) %>% names()) %>%
setdiff(c("Footnote", "abb", "HHSRegion"))
# Test for each variable in allCheckVars
subMap <- c("State"="United States", "Age"="All Ages", "deathPlace"="Total - All Places of Death")
lapply(c("State", "deathPlace", "Age"),
FUN=function(x) deathAge_20210824_conv %>%
select(-Year, -Month) %>%
checkSubTotals(checkByVars=allCheckVars %>% setdiff(x), subVar=x, subVarTotal=unname(subMap[x])) %>%
checkNumbers(byVars=allCheckVars, keyVar=x)
)
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 1,118 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-08-18 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-08-18 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 116
## 5 2021-08-18 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Decedent's~ 65-7~ pneum~ 143
## # ... with 1,108 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## Sub-lists are identical by: asofDate, startDate, endDate, Group, State, deathPlace, Age
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
## [[1]]
## # A tibble: 1,118 x 12
## asofDate startDate endDate Group State deathPlace Age name dfSub
## <date> <date> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 205
## 2 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 71
## 3 2021-08-18 2020-11-01 2020-11-30 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 227
## 4 2021-08-18 2020-08-01 2020-08-31 By Mo~ Unite~ Other 0-17~ total~ 116
## 5 2021-08-18 2020-09-01 2020-09-30 By Mo~ Unite~ Decedent's~ 50-6~ pnemo~ 189
## 6 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pnemo~ 183
## 7 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Healthcare~ 65-7~ fluDe~ 204
## 8 2021-08-18 2020-02-01 2020-02-29 By Mo~ Unite~ Total - Al~ 65-7~ fluDe~ 317
## 9 2021-08-18 2020-01-01 2020-01-31 By Mo~ Unite~ Total - Al~ 30-3~ pneum~ 73
## 10 2021-08-18 2020-10-01 2020-10-31 By Mo~ Unite~ Decedent's~ 65-7~ pneum~ 143
## # ... with 1,108 more rows, and 3 more variables: dfTot <dbl>, delta <dbl>,
## # pct <dbl>
##
## [[2]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
##
## [[3]]
## # A tibble: 0 x 12
## # ... with 12 variables: asofDate <date>, startDate <date>, endDate <date>,
## # Group <chr>, State <chr>, deathPlace <chr>, Age <chr>, name <chr>,
## # dfSub <dbl>, dfTot <dbl>, delta <dbl>, pct <dbl>
Variables Age and deathPlace appear to be well-aligned between sub-totals and totals, while variable State shows some more significant differences. Next steps are to further research what is contained in State, including alignment to other data sources.
Deaths by state are compared between files, using July 31, 2021 as the cutoff:
# Create summary by state and year-month
death_sum_210824 <- deathAge_20210824_conv %>%
filter(!is.na(Year), !is.na(Month), deathPlace=="Total - All Places of Death", Age=="All Ages") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month))),
abb=c(state.abb, "DC", "US")[match(State, c(state.name, "District of Columbia", "United States"))]
) %>%
select(State, abb, ym, where(is.numeric), -Year, -Month) %>%
pivot_longer(-c(State, abb, ym)) %>%
arrange(State, abb, name, ym) %>%
group_by(State, abb, name) %>%
mutate(cumValue=cumsum(ifelse(is.na(value), 0, value))) %>%
ungroup() %>%
mutate(date=lubridate::ceiling_date(ym, unit="month")-lubridate::days(1))
# Create summary from state-level file
death_daily_210815 <- readFromRDS("cdc_daily_210815")$dfPerCapita %>%
select(date, abb=state, tot_deaths) %>%
mutate(Year=lubridate::year(date), Month=lubridate::month(date)) %>%
group_by(Year, Month) %>%
filter(date==max(date)) %>%
ungroup()
# Create a plot for evolution of United States
death_sum_210824 %>%
filter(abb=="US", name=="covidDeaths", ym <= "2021-07-31") %>%
ggplot(aes(x=date)) +
geom_line(aes(y=cumValue/1000, color="blue"), size=2) +
geom_point(data=summarize(group_by(filter(death_daily_210815, date <= "2021-07-31"), date),
tot_deaths=sum(tot_deaths, na.rm=TRUE)
),
aes(y=tot_deaths/1000, color="green"),
size=3
) +
labs(x="End of month", y="Cumulative Deaths (000)", title="Cumulative COVID Deaths (000) in US by source") +
scale_color_manual("Source", labels=c("Summed\nstates", "Summed\nsubtotals"), values=c("green", "blue"))
Cumulative deaths by month for total US appear consistent across the files. Next steps are to continue exploring for state-level data:
# Create a plot for total by states
death_sum_210824 %>%
filter(abb %in% c(state.abb, "DC"), name=="covidDeaths", date == "2021-07-31") %>%
ggplot() +
geom_col(aes(x=fct_reorder(abb, cumValue), y=cumValue/1000), fill="lightblue") +
geom_point(data=filter(death_daily_210815, date == "2021-07-31"),
aes(x=abb, y=tot_deaths/1000),
size=3
) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title="Cumulative COVID Deaths (000) in US as of 2021-07-31",
subtitle="Filled bars are summed subtotals, points are from CDC daily")
# Same plot using merged data
plot_cum0721 <- death_sum_210824 %>%
filter(abb %in% c(state.abb, "DC"), name=="covidDeaths", date == "2021-07-31") %>%
select(abb, cumValue) %>%
inner_join(select(filter(death_daily_210815, date == "2021-07-31"), abb, tot_deaths), by=c("abb")) %>%
mutate(pctdiff=abs(tot_deaths-cumValue)/(tot_deaths+cumValue))
plot_cum0721 %>%
arrange(-pctdiff)
## # A tibble: 51 x 4
## abb cumValue tot_deaths pctdiff
## <chr> <dbl> <dbl> <dbl>
## 1 NY 25579 53524 0.353
## 2 MA 13713 18082 0.137
## 3 DC 1500 1149 0.133
## 4 NE 2963 2280 0.130
## 5 MO 12003 9667 0.108
## 6 GA 18335 21683 0.0837
## 7 OK 8845 7515 0.0813
## 8 AK 327 382 0.0776
## 9 WY 672 776 0.0718
## 10 ND 1766 1539 0.0687
## # ... with 41 more rows
plot_cum0721 %>%
summarize(across(where(is.numeric), sum))
## # A tibble: 1 x 3
## cumValue tot_deaths pctdiff
## <dbl> <dbl> <dbl>
## 1 581194 609079 2.27
plot_cum0721 %>%
ggplot(aes(x=fct_reorder(abb, cumValue))) +
geom_col(aes(y=cumValue/1000), fill="lightblue") +
geom_point(aes(y=tot_deaths/1000), size=3) +
coord_flip() +
labs(x=NULL,
y="Cumulative Deaths (000)",
title="Cumulative COVID Deaths (000) in US as of 2021-07-31",
subtitle="Filled bars are summed subtotals, points are from CDC daily"
)
The New York City data will need to be added to NY for further analysis. There are some surprising differences in total deaths reported by state, even as total deaths (after adding Nyc) are nearly identical between the files.
Breakdown of deaths by age is also explored:
deathAllData <- deathAge_20210824_conv %>%
filter(deathPlace=="Total - All Places of Death")
deathAllData
## # A tibble: 11,178 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 3 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 4 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 5 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 6 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 7 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 8 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 9 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 10 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 4 Alab~ Total - A~
## # ... with 11,168 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by age and cause
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Total") %>%
select(Age, where(is.numeric), -Year, -Month) %>%
pivot_longer(-Age) %>%
ggplot() +
geom_col(aes(x=name, y=value, fill=fct_rev(Age)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by cause (2020-August 2021)") +
scale_fill_discrete("Age")
# Proportions of death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(Age)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by age and cause (2020-August 2021)") +
scale_fill_discrete("Age")
# Total death by age and month
deathAllData %>%
filter(State=="United States", Age != "All Ages", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(Age, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(Age, ym)) %>%
filter(ym != "2021-08-01") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(Age), group=Age)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by age and cause (2020-July 2021)") +
scale_color_discrete("Age")
There are very few reported flu deaths in the 2020-2021 data. The change in covidDeaths by age over time appears to be at most a minor driver of the change in totalDeaths by age over time. This is consistent with covidDeaths being in the 10%-20% range of totalDeaths, distributed by age (to a first order) in a somewhat similar pattern.
A similar process is run for place of death:
deathPlaceData <- deathAge_20210824_conv %>%
filter(Age == "All Ages")
deathPlaceData
## # A tibble: 11,178 x 18
## asofDate startDate endDate Group Year Month HHSRegion State deathPlace
## <date> <date> <date> <chr> <int> <int> <chr> <chr> <chr>
## 1 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Total - A~
## 2 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 3 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 4 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Healthcar~
## 5 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Decedent'~
## 6 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Hospice f~
## 7 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Nursing h~
## 8 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Other
## 9 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 0 Unit~ Place of ~
## 10 2021-08-18 2020-01-01 2021-08-14 By T~ NA NA 4 Alab~ Total - A~
## # ... with 11,168 more rows, and 9 more variables: Age <chr>,
## # covidDeaths <dbl>, totalDeaths <dbl>, pneumoDeaths <dbl>,
## # pneumoCovidDeaths <dbl>, fluDeaths <dbl>, pnemoFluCovidDeaths <dbl>,
## # Footnote <chr>, abb <chr>
# Proportions of death by place and cause
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Total") %>%
select(deathPlace, where(is.numeric), -Year, -Month) %>%
pivot_longer(-deathPlace) %>%
ggplot() +
coord_flip() +
geom_col(aes(x=name, y=value, fill=fct_rev(deathPlace)), position="fill") +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place (2020-August 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Proportions of death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
ggplot() +
geom_col(aes(x=ym, y=value, fill=fct_rev(deathPlace)), position="fill") +
facet_wrap(~name) +
labs(x=NULL, y="Proportion of Deaths", title="Proportion of deaths by place and cause (2020-August 2021)") +
scale_fill_discrete("Death\nPlace") +
theme(legend.position="bottom")
# Total death by place and month
deathPlaceData %>%
filter(State=="United States", deathPlace!="Total - All Places of Death", Group=="By Month") %>%
mutate(ym=lubridate::ym(paste0(Year, "-", zeroPad2(Month)))) %>%
select(deathPlace, ym, totalDeaths, covidDeaths, fluDeaths) %>%
pivot_longer(-c(deathPlace, ym)) %>%
filter(ym != "2021-08-01") %>%
ggplot() +
geom_line(aes(x=ym, y=value, color=fct_rev(deathPlace), group=deathPlace)) +
facet_wrap(~name, scales="free_y") +
labs(x=NULL, y="Proportion of Deaths", title="Deaths by place and cause (2020-July 2021)") +
scale_color_discrete("Death\nPlace")
Relative to overall deaths, COVID deaths appear more prevalent in the inpatient healthcare setting or nursing home and less prevalent at home. The proportion has moved away from nursing homes and towards inpatient (hospital) as the pandemic progressed.